Y. Xia, J. Sun Bioinformatic and Statistical Analysis of Microbiome Data https://doi.org/10.1007/978-3-031-21391-5_12

12. Differential Abundance Analysis of Microbiome Data

Yinglin Xia¹ and Jun Sun ¹

(1)

Department of Medicine, University of Illinois Chicago, Chicago, IL, USA

Abstract

Differential abundance analysis (DDA) is an active area in microbiome research. The main goal of DDA is to identify features (i.e., OTUs, ASVs, taxa, species) that are differentially abundant across sample groups. Microbiome data are sparse and have many zeros; thus to appropriately identify features, the models should be able to address the issues of over-dispersion, zero-inflation, and sparsity. This chapter first introduces the Zero-Inflated Gaussian (ZIG) and Zero-Inflated Log-normal (ZILN) mixture models. Then it illustrates the ZILN via the metagenomeSeq package. Next, it describes some additional statistical tests and some useful functions in metagenomeSeq. Next, some remarks on CSS, ZIG, and ZILN are also provided.

Keywords

Zero-inflated Gaussian (ZIG) Zero-inflated log-normal (ZILN) Total sum scaling (TSS) Cumulative sum scaling (CSS) metagenomeSeq Log normal permutation test Presence-absence testing Discovery odds ratio testing Feature correlations MRexperiment Object cumNormMat() MRcounts() Normalization factors calcNormFactors() Library sizes libSize() Normalized counts exportMat() exportStats() Taxa

In microbiome study, differential abundance analysis (DAA) is an active area of research. Over 12 years ago, the term “differential abundance” was coined as a direct analogy to differential expression from RNA-seq and has been adopted to use in microbiome literature (McMurdie and Holmes 2014; White et al. 2009; Paulson et al. 2013). The main goal of DAA is to identify features (i.e., OTUs, ASVs, taxa, species, etc.) that are differentially abundant across sample groups (i.e., features present in different abundances across sample groups).

Microbiome data are sparse and have many zeros; to appropriately identify features, the models that are used to analyze microbiome data should have the capability to address the issues of overdispersion, zero inflation, and sparsity. To address the overdispersion and zero inflation while identifying the microbial taxa that are associated with covariates, several specifically designed statistical models for microbiome data have been proposed. In this chapter, we first introduce the zero-inflated Gaussian (ZIG) and zero-inflated log-normal (ZILN) mixture models (Sect. 12.1). Then we illustrate the ZILN via the metagenomeSeq package (Sect. 12.2). Sections 12.3 and 12.4 introduce and illustrate some additional statistical tests and some useful functions in metagenomeSeq, respectively. Next, we provide some remarks on CSS, ZIG, and ZILN in Sect. 12.5. Finally, we complete this chapter with a brief summary in Sect. 12.6.

12.1 Zero-Inflated Gaussian (ZIG) and Zero-Inflated Log-Normal (ZILN) Mixture Models

Both ZIG mixture and ZILN models are implemented in the metagenomeSeq package (Paulson 2020) via the functions fitZig() and fitFeatureModel(), respectively. The function fitZig() relies heavily on the limma package (Smyth 2005). The ZIG mixture was proposed first in 2013 (Paulson et al. 2013), which estimates the probability whether a zero for a particular feature in a sample is a technical zero or not. The fitFeatureModel() was later reparametrized to fit a zero-inflated model for each specific OTU separately. Currently the ZILN model along with the fitFeatureModel() is recommended (Paulson 2020).

The proposal of ZIG mixture model was motivated by the observation that the depth of coverage in a sample is directly related to how many features are detected in a sample. It was shown that there exist a relationship between depth of coverage and OTU identification ubiquitous in marker-gene survey datasets. Thus, ZIG mixture model was proposed to use a probability mass to accommodate the excess zero counts (both sampling zeros and structural zeros) and a Gaussian distribution to model the mean group abundance (the nonzero observed counts) for each taxonomic feature to explicitly account for biases in DAA resulting from under-sampling of the microbial community.

The model directly estimates the probability that an observed zero is generated from the detection distribution owing to under-sampling (sampling zeros) or due to absence of the taxonomic feature in the microbial community (structural zeros), i.e., from the count distribution. An expectation-maximization algorithm is used to estimate the expected value of latent component indicators based on sample sequencing depth of coverage.

The model is implemented via the metagenomeSeq package, mainly by implementing two methods: the cumulative sum scaling (CSS) normalization and zero-inflated Gaussian (ZIG) distribution mixture model.

The ZIG mixture model was designed to use the CSS normalization technique to correct the bias in the assessment of differential abundance (DA) introduced by total sum (TSS) normalization.

12.1.1 Total Sum Scaling (TSS)

In RNA-seq and microbiome literature, TSS (Paulson et al. 2013; Chen et al. 2018) and proportion (McMurdie and Holmes 2014; Weiss et al. 2017) are referred as the same method. TSS uses the total read counts for each sample as the size factors to estimate the library size or scale the matrix counts. TSS normalizes count data by dividing taxon or OTU read counts by the total number of reads in each sample to convert the counts to proportion.

Assume raw data is given as count matrix M(m, n), where m and n are the number of features (i.e., taxa, OTUs) and samples, respectively. Let counts c_ij represent the number of times taxonomic feature i was observed in sample j, the sum of counts for sample i is denoted as s_j = ∑_ic_ij, and then the normalized counts produced by the TSS normalization method is

${\tilde{c}}_{ij}=\frac{c_{ij}}{s_j}.$

(12.1)

At the beginning, microbiome researchers in practice simply used proportion or rarefying to normalize microbiome read counts. The proportion approach was directly adopted from RNA-seq field, while rarefying was inspired by the idea of rarefaction in ecology and hypergeometric model in RNA-seq data. In RNA-seq literature, TSS normalization has been evaluated to incorrectly bias differential abundance estimates (Bullard et al. 2010; Dillies et al. 2012) because a few genes could be sampled preferentially as sequencing yield increases in RNA-seq data derived through high-throughput technologies. For 16S rRNA-seq data, TSS or proportion also has been reviewed having limitations, including:

1.
TSS is not robust to outliers (Chen et al. 2018), is inefficient to address heteroscedasticity, is unable to address overdispersion, and results in a high rate of false positives in tests for species. This systematic bias increases the type I error rate even after correcting for multiple testing (McMurdie and Holmes 2014).
2.
TSS lies on the constant-sum constraint; hence it cannot remove compositionality; instead it is prone to create compositional effects, making nondifferential taxa or OTUs appear to be differential (Chen et al. 2018; Mandal et al. 2015; Tsilimigras and Fodor 2016; Morton et al. 2017).
3.
TSS has a poor performance in terms of both FDR control and statistical power at the same false-positive rate, compared to other scaling or size factor-based methods such as GMPR and RLE because of strong compositional effects (Chen et al. 2018).
4.
Actually, like to rarefying, TSS or proportion is built on the assumption that the individual gene or taxon count in each sample was randomly sampled from the reads in multiple samples. Thus, counts can be divided by total library size to convert to proportion, and gene expression analysis or taxon abundance analysis can be fitted via a Poisson distribution. Because in both RNA-seq data and 16S rRNA-seq data, the assumption cannot meet, the sample proportion approach is inappropriate for detection of differentially abundant species (McMurdie and Holmes 2014) and should not be used for most statistical analyses (Weiss et al. 2017).

12.1.2 Cumulative Sum Scaling (CSS)

CSS normalization (Paulson et al. 2013) is a normalization method specifically designed for microbiome data. CSS aims to correct the bias in the assessment of differential abundance introduced by total sum normalization. CSS method is to divide raw counts into the cumulative sum of counts, up to a percentile determined using a data-driven approach in order to capture the relatively invariant count distribution for a dataset. In other words, the choices of percentiles are driven by empirical rather than theoretical considerations.

Let ${q}_j^l$ denote the lth quantile of sample j (i.e., in sample j there are l taxonomic features with counts smaller than ${q}_j^l$ ). For l = ⌊.95 m⌋, ${q}_j^l$ corresponds to the 95th percentile of the count distribution for sample j. Also denote ${s}_j^l=\sum \limits_{i\mid {c}_{ij}\le {q}_j^l}{c}_{ij}$ as the sum of counts for sample j up to the lth quantile. Using this notation, the total sum ${s}_j={s}_j^m$ , CSS normalization method chooses a value l ≤ m as a normalization scaling factor for each sample to produce normalized counts ${\tilde{c}}_{ij}=\left(\frac{c_{ij}}{s_j^l}\right)(N)$ , where N is an appropriately chosen normalization constant.

In practice, the same constant N (the median is recommended using as scaling factor ${s}_j^i$ across samples) is used to scale all samples so that normalized counts have interpretable units. Actually, CSS defines the sample-specific count distributions l ≤ m as reference distribution (samples) and interprets counts for other samples as relative to the reference.

CSS requires to set the threshold as at least the 50th percentile. Then CSS calculate the normalization factors for samples as the sum over the taxa or OTUs counts up to the threshold. In such way, CSS method optimizes the threshold from the data to minimize the influence of variable high-abundant taxa or OTUs. For example, to mitigate the influence of larger count values in the same matrix column, CSS only scales the portion of each sample’s count distribution that is relatively invariant across samples. CSS normalization can be implemented using metagenomeSeq package. By default, 50th percentile is set as the threshold.

12.1.3 ZIG and ZILN Models

ZIG mixture model assumes that CSS-normalized sample abundance measurements are approximately log-normally distributed given large numbers of samples. Thus, in ZIG the normalized count data is first performed a logarithmic transform to control the variability of taxonomic feature measurements across samples.

Let n_A and n_B denote the microbiome count data for samples A and B from two populations, respectively, each with m features (OTUs); and let c_ij denote the raw count for sample j and feature i and k(j) = I{j ∈ groupA} be the class indicator function. Then, the continuity-corrected log₂ of the raw count data y_ij = log₂(c_ij + 1) is modeled by the zero-inflated model as a mixture of a point mass at zero I_{0}(y) and a count distribution f_count(y; μ, σ²) ∼ N(μ, σ²). Given mixture parameters π_j, the density of the zero-inflated Gaussian distribution for feature i in sample j with s_j total counts is given as

${f}_{zig}\left({y}_{ij};{s}_j,\beta, {\mu}_i,{\sigma}_i^2\right)={\pi}_j\left({s}_j\right)\cdot {I}_{\left\{0\right\}}\left({y}_{ij}\right)+\left(1-{\pi}_j\left({s}_j\right)\right)\cdot {f}_{\mathrm{count}}\left({y}_{ij};{\mu}_i,{\sigma}_i^2\right),$

(12.2)

while the mean model is specified as

$E\left({y}_{ij}|k(j)\right)={\pi}_j\cdot 0+\left(1-{\pi}_j\right)\cdot \left({b}_{i0}+{\eta}_i{\log}_2\left(\frac{s_j^l+1}{N}\right)+{b}_{i1}k(j)\right),$

(12.3)

where the parameter b_i1 is an estimate of fold change in mean normalized counts between the two populations. The logged normalization factor ${\log}_2\left(\frac{s_j^l}{N}\right)$ term is used to capture OTU-specific normalization factors (e.g., feature-specific biases in PCR amplification) through parameter η_i. ZIG model can also be specified without OTU-specific normalization via including an offset term to serve as the normalization factor like in the linear model.

To model the dependency of the number of zero-valued features of a sample on its total number of counts, the mixture parameter π_j(s_j) is modeled as a binomial process:

$\log \left(\frac{\pi_j}{1-{\pi}_j}\right)={\beta}_0+{\beta}_1\cdot \log \left({s}_j\right).$

(12.4)

ZIG model was developed under the framework of linear modeling and standard conventions in DAA methods in gene expression; thus it has the advantage of controlling for confounding factors. Except the group membership variable, other clinical covariates can also be incorporated into the ZIG model to detect the confounding effect. Appropriate covariates can also be included to capture variability in the sampling process. The full set of estimates are denoted as θ_ij = {β₀, β₁, b_i0, η_i, b_i1}. The mixture membership is treated as Δ_ij = 1 if y_ij is generated from the zero point mass as latent indicator variables. Then the log likelihood in this extended model is given as

$l\left({\theta}_{ij};{y}_{ij},{s}_j\right)=\left(1-{\Delta}_{ij}\right)\log {f}_{\mathrm{count}}\left(y;{\mu}_i,{\sigma}_i^2\right)+{\Delta}_{ij}\log {\pi}_i\left({s}_j\right)+\left(1-{\Delta}_{ij}\right)\log \left\{1-{\pi}_i\left({s}_j\right)\right\}.$

(12.5)

By this formation, this extended model is modeled as a mixture of a point mass at zero and a normal distribution, in which the part of the point mass at zero is used to model features (OTUs/taxa) that are usually sparse and have many zero counts and the normal distribution part is used to model the log transformed counts of the mean group abundance (the nonzero observed counts).The log maximum likelihood estimates are approximated by using the expectation-maximization algorithm. The function fitZig() estimates the probability whether a zero for a particular feature in a sample is a technical zero or not.

A moderated t-statistic fitFeatureModel is constructed to estimate fold change (b_1i) (only for the count component of the ZIG mixture model) and its standard error by empirical Bayes shrinkage of parameter estimates (Smyth 2005). The P-values are obtained by a hypothesis testing the null of b_1i = 0 using a parametric t-distribution. The zero-inflated log-normal model has been wrapped into the fitFeatureModel() to fit a zero-inflated model for each specific feature (OTU/ taxon) separately. Currently, fitFeatureModel() is only implementable to binary covariate case. The moderated t-statistic is defined as below:

${t}_i=\frac{b_{1i}}{{\left({\tilde{s}}_i^2/{\sum}_j\left(1-{z}_{ij}\right)\right)}^{1/2}}.$

(12.6)

where ${\tilde{s}}_i^2=\frac{d_0{s}_0^2+{d}_i{s}_i^2}{d_0+{d}_i}$ is obtained by pooling all features’ variances as described in Smyth (2005); ${s}_i^2$ and d_i are the observed feature variance and degrees of freedom, respectively; and d₀ and ${s}_0^2$ are estimated using the method of moments incorporating all feature variances and degrees of freedom, respectively. The multiple testing of features is corrected using the Q-value method (Storey and Tibshirani 2003). The choice of using a log-normal distribution in the count component of the mixture rather than a generalized linear model (e.g., negative binomial) in forming the fitFeatureModel() function is for computational efficiency and numerical stability as well as appropriateness of the log-normal distribution for the marker-gene survey study with moderate to large sample sizes (Soneson and Delorenzi 2013; Paulson et al. 2013).

12.2 Implement ZILN via metagenomeSeq

The metagenomeSeq package was designed to detect features (i.e., OTU, genus, species, etc.) that are differentially abundant between two or more groups of multiple samples. The software was also designed to implement the proposed CSS normalization technique and ZIG mixture model to account for sparsity due to under-sampling of microbial communities on disease association detection and testing of feature correlations (Paulson et al. 2013). The metagenomeSeq package implements two functions to (1) control for biases in measurements across taxonomic features via CSS and (2) account for varying depths of coverage via ZIG.

To facilitate modeling covariate effects or confounding sources of variability and interpreting results, the metagenomeSeq package uses linear model framework and provides a few visualization functions to examine discoveries. Generally, to implement metagenomeSeq, we need to take six steps: (1) load OTU, taxonomy data, and metadata; (2) create MRexperiment object; (3) normalize data via MRexperiment objects; (4) perform statistical testing of abundance or presence-absence; (5) visualize and save analysis results; and (6) visualize features. We illustrate how to implement ZIG using metagenomeSeq step by step as below.

Example 12.1: Breast Cancer QTRT1 Mouse Gut Microbiome

We introduced this dataset in Example 9.1 (Zhang et al. 2020) and used it to illustrate calculations and statistical hypothesis testing of alpha and beta diversities in Chaps. 9, 10, and 11, respectively. Here we continue to use this dataset to illustrate the metagenomeSeq package.

Step 1: Load OTU table, annotated taxonomy data, and metadata.

The metagenomeSeq package requires a feature/OTU table to store the count data in a delimited (tab by default) format with sample names along the first row and feature names along the first column. As recall in Chap. 2, a typical feature (OTU) table is given below:

	sample₁	sample₂	…	sample_n
feature₁	c₁₁	c₁₂	…	c_1n
feature₂	c₂₁	c₂₂	…	c_2n
. . .
feature_m	c_m1	c_m2	…	c_mn

The above table presents m features in n samples, where the elements in a count matrix C (m, n), c_ij, are the number of reads annotated for a particular feature i (i.e., OTU, genus, species, etc.) in sample j.

To create a MRexperiment object, we need to load count data, taxonomy data, and metadata. First we set up working directory and install the metagenomeSeq package (version 1.36.0, April 10, 2022):

> setwd("~/Documents/QIIME2R/Ch12_Differential")

To install the metagenomeSeq package, type the following command in R (starting version 3.6):

> if (!requireNamespace("BiocManager", quietly = TRUE))

+ install.packages("BiocManager")

> BiocManager::install("metagenomeSeq")

The metagenomeSeq package requires a feature table, which is a count matrix with features along rows and samples along the columns. We can use the loadMeta() function to load count data, which is tab-delimited OTU matrix. The loadMeta() function loads the taxa (OTUs) and counts into a list:

> library(metagenomeSeq)

> # Load otu table

> otu_tab = loadMeta("otu_table_L7_MCF7_ZIG.txt")

> head(otu_tab$counts,3)

Sun071.PG1 Sun027.BF2 Sun066.PF1 Sun029.BF4 Sun068.PF3 Sun026.BF1

OTU_1 0 0 0 0 0 0

OTU_2 0 0 0 0 1 0

OTU_3 0 0 0 0 1 0

------

> head(otu_tab$taxa,3)

taxa

1 OTU_1

2 OTU_2

3 OTU_3

> dim(otu_tab$counts)

[1] 635 40

> dim(otu_tab$taxa)

[1] 635 1

Now we load the tax table. It requires that the row order of OTUs in taxonomy table (matrix with tab-delimited format) must match the row order of OTUs in OTU table (matrix with tab-delimited format):

> # Load taxa table

> tax_tab = read.delim( "tax_table_L7_MCF7_ZIG.txt", sep = "\t", stringsAsFactors = FALSE, row.names = 1)

> head(tax_tab,3)

Kingdom Phylum Class Order

OTU_1 D_0__Archaea D_1__Thaumarchaeota D_2__Nitrososphaeria D_3__Nitrososphaerales

OTU_2 D_0__Bacteria D_1__Acidobacteria D_2__Acidobacteriia D_3__Solibacterales

OTU_3 D_0__Bacteria D_1__Acidobacteria D_2__Acidobacteriia D_3__Solibacterales

Family Genus Species

OTU_1 D_4__Nitrososphaeraceae Other Other

OTU_2 D_4__Solibacteraceae (Subgroup 3) D_5__Bryobacter Other

OTU_3 D_4__Solibacteraceae (Subgroup 3) Other Other

In above read.delim () function, specifying the argument “stringsAsFactors = FALSE” is important. By default, “stringsAsFactors” is set to TRUE in the read.delim() and other functions including read.table(), read.csv(), and read.csv2() when importing the columns containing character strings into R as factors. The reason for setting strings as factors is to tell R to treat categorical variables into individual dummy variables for modeling functions like lm() and glm(). However, in microbiome or genomics study, it doesn’t make sense to encode the names of the taxa or genes in one column of data as factors because they are essentially just labels and not to be used in any modeling function. So, we need to specify the “stringsAsFactors” argument into FALSE to change the default setting.

Phenotype data can be optionally loaded into R with loadPhenoData(). This function loads the data as a list. Now we use loadPhenoData () function to load the metadata into R:

> # Load metadata

> meta_tab = loadPhenoData("metadata_QtRNA.txt", tran = TRUE, sep = "\t")

> head(meta_tab,3)

MouseID Genotype Group Time Group4 Total.Read

Sun071.PG1 PG1 KO 1 Post KO_POST 61851

Sun027.BF2 BF2 WT 0 Before WT_BEFORE 42738

Sun066.PF1 PF1 WT 0 Post WT_POST 54043

> match_ord = match(colnames(otu_tab$counts), rownames(meta_tab))

> meta_tab = meta_tab[match_ord, ]

> head(meta_tab[1:2], 3)

MouseID Genotype

Sun071.PG1 PG1 KO

Sun027.BF2 BF2 WT

Sun066.PF1 PF1 WT

The argument tran = (Boolean) is used to specify whether the covariates are along the columns and samples along the rows. If it is true, then tran should equal TRUE.

Step 2: Create MRexperiment object.

In R the S4 class system allows for object-oriented definitions. The MRexperiment object of metagenomeSeq was built using the Biobase package in Bioconductor and its virtual class, eSet (Gentleman et al. 2020). The eSet class (container) is used to contain high-throughput assays and experimental metadata. Classes derived from eSet contain one or more identical-sized matrices as assayData elements. To derive classes (e.g., ExpressionSet-class, SnpSet-class), we need to specify which elements must be present in the assayData slot. As a virtual class, eSet object cannot be instantiated directly. To assess eSet object, we need to find its slots. The slots of eSet object include:

assayData: contains matrices with equal dimensions and with column number equal to nrow(phenoData), which is an AssayData-class.
phenoData: contains experimenter-supplied variables describing sample (i.e., columns in assayData) phenotypes, which is an AnnotatedDataFrame-class.
featureData: contains variables describing features (i.e., rows in assayData) unique to this experiment, which is an AnnotatedDataFrame-class.
experimentData: contains details of experimental methods, which is a MIAME-class.
annotation: label associated with the annotation package used in the experiment, which is character.
protocolData: contains microarray equipment-generated variables describing sample (i.e., columns in assayData) phenotypes, which is an AnnotatedDataFrame-class.

The MRexperiment object uses a simple extension of eSet called expSummary (the S4 class with adding a single experiment summary slot, a data frame) to include the depth of coverage and the normalization factors for each sample. In MRexperiment object, the three main slots of assayData, phenoData, and featureData are most useful. All matrices are organized in the assayData slot, all phenotype data is stored in phenoData, and feature data (OTUs, taxonomic assignment to varying levels, etc.) is stored in featureData. Additional slots are available for reproducibility and annotation.

A MRexperiment object will be created by the function newMRexperiment(). This function takes a count matrix, phenoData (annotated data frame), and featureData (annotated data frame) as input. To create a MRexperiment object, call the newMRexperiment() and pass the counts, phenotype, and feature data as parameters. After the datasets are formatted as MRexperiment objects, the analysis is relatively easy using the metagenomeSeq package.

Below we create a MRexperiment object named “MRobj”. First, we convert both phenoData and featureData into data frames using the AnnotatedDataFrame() function:

> phenoData = AnnotatedDataFrame(meta_tab)

> phenoData

An object of class 'AnnotatedDataFrame'

rowNames: Sun071.PG1 Sun027.BF2 ... Sun062.PE2 (40 total)

varLabels: MouseID Genotype ... Total.Read (6 total)

varMetadata: labelDescription

> taxaData = AnnotatedDataFrame(tax_tab)

> taxaData

An object of class 'AnnotatedDataFrame'

rowNames: OTU_1 OTU_2 ... OTU_635 (635 total)

varLabels: Kingdom Phylum ... Species (7 total)

varMetadata: labelDescription

Then we call the newMRexperiment() function to create the object “MRobj”:

> MRobj = newMRexperiment(otu_tab$counts,phenoData=phenoData,featureData=taxaData)

> # Links to a paper providing further details can be included optionally.

> # experimentData(obj) = annotate::pmid2MIAME("21680950")

> MRobj

MRexperiment (storageMode: environment)

assayData: 635 features, 40 samples

element names: counts

protocolData: none

phenoData

sampleNames: Sun071.PG1 Sun027.BF2 ... Sun062.PE2 (40 total)

varLabels: MouseID Genotype ... Total.Read (6 total)

varMetadata: labelDescription

featureData

featureNames: OTU_1 OTU_2 ... OTU_635 (635 total)

fvarLabels: Kingdom Phylum ... Species (7 total)

fvarMetadata: labelDescription

experimentData: use 'experimentData(object)'

Annotation:

Step 3: Normalize data via MRexperiment objects.

After the MRexperiment object was created, we can use the MRexperiment object to normalize the data, perform statistical tests (abundance or presence-absence), and visualize or save the analysis results.

The proposed method implements two data-driven methods for the percentile calculation:

(i)
Calculate normalization factors using the function cumNorm().

The normalization starts with the first critical step: calculate the proper percentile by which to normalize counts. The normalization factors are stored in the experiment summary slot.

First, check if the samples were normalized.

We can use the normFactors () function to access the normalization factors (aka the scaling factors) of samples in a MRexperiment object:

> # Access the normalization factors (aka the scaling factors)

> # Check if the samples were normalized

> head(normFactors(MRobj),3)

[,1]

Sun071.PG1 NA

Sun027.BF2 NA

Sun066.PF1 NA

The above outputs showed that the samples were not normalized.

Second, calculate the proper percentile using the cumNormStat()and cumNormStatFast().

To calculate the proper percentile for which to sum counts up to and scale by, we can use either cumNormStat () or cumNormStatFast().

The syntax of cumNormStat () is given as:

cumNormStat(obj, qFlag = TRUE, pFlag = FALSE, rel = 0.1, ...),

where the argument obj is a matrix or MRexperiment object; qFlag is a flag that is used to either calculate the proper percentile using R’s stepwise quantile function or approximate function; pFlag is a flag that is used to plot the relative difference of the median deviance from the reference; rel is used to specify a cutoff for the relative difference from one median difference from the reference to the next; and the argument ... is used to specify additional plotting parameters, which is applicable if pFlag ==TRUE.

The cumNormStatFast() function is faster than the cumNormStat() function.

The syntax of cumNormStatFast() is given as:

cumNormStatFast(obj, pFlag = FALSE, rel = 0.1, ...)

Below we use the cumNormStatFast() and the cumNormStat () functions to calculate the percentile, respectively. By using default, these two functions generate same percentile values:

> ptile = round(cumNormStatFast(MRobj,pFlag=FALSE),digits=2)

Default value being used.

> ptile

[1] 0.5

> ptile1 = round(cumNormStat(MR_obj,pFlag=FALSE),digits=2)

Default value being used.

> ptile1

[1] 0.5

Third, calculate the scaling factors using the cumNorm ().

The cumNorm () function calculates each column’s quantile and the sum up to quantile including that quantile. We can either first calculate the percentile using the cumNormStatFast()/cumNormStat () and then use this percentile as input to the cumNorm () function, like this p = cumNormStatFast(obj), obj = cumNorm(obj, p = p), or directly call the cumNorm() function like this: cumNorm(obj, p = cumNormStatFast(obj)):

> # Calculate the scaling factors

> QtRNAData = cumNorm(MRobj, p = ptile)

> QtRNAData

MRexperiment (storageMode: environment)

assayData: 635 features, 40 samples

element names: counts

protocolData: none

phenoData

sampleNames: Sun071.PG1 Sun027.BF2 ... Sun062.PE2 (40 total)

varLabels: MouseID Genotype ... Total.Read (6 total)

varMetadata: labelDescription

featureData

featureNames: OTU_1 OTU_2 ... OTU_635 (635 total)

fvarLabels: Kingdom Phylum ... Species (7 total)

fvarMetadata: labelDescription

experimentData: use 'experimentData(object)'

Annotation:

We can also combine the cumNorm() and cumNormStatFast() functions in one step to calculate the scaling factors:

> QtRNAData1 = cumNorm(MRobj, p = cumNormStatFast(MRobj))

(ii)
Calculate normalization factors using the function wrenchNorm ().

An alternative to normalizing counts (calculating normalization factors) using cumNorm is to use the wrenchNorm method. These two functions behave similarly; however, the difference is that the cumNorm method uses the percentile, while the wrenchNorm method takes the condition as argument. The condition is a factor with values that separate samples into phenotypic groups of interest (in this case, Group/Genotype). The authors of this package stated that when it is used appropriately, wrench normalization is preferable over cumulative normalization:

> conds = MRobj$Group

> QtRNAData2 = wrenchNorm(MRobj, condition = conds)

Step 4: Perform statistical testing of abundance.

The differential abundance analysis of features (e.g., OTUs) can be fitted using the functions fitFeatureModel() and fitZig(). The fitFeatureModel() was a latest development by reparametrizing the authors’ original zero-inflation model to fit zero-inflated log-normal mixture model for each feature (OTU) separately. The fitFeatureModel() was recommended over fitZig() in the metagenomeSeq package.

The syntax is given as below:

fitFeatureModel(obj, mod, coef = 2, B = 1, szero = FALSE, spos = TRUE).

where the argument:

obj is a MRexperiment object with count data.
mod is used to specify the model for the count distribution.
coef is the coefficient of interest to grab log fold changes.
B is used to specify the number of bootstraps to perform. If B is specified greater than 1, then it performs permutation test.
szero (TRUE/FALSE) is used to shrink zero component parameters.
spos (TRUE/FALSE) is used to shrink positive component parameters.

After implementing the fitFeatureModel(), it returns a list of objects including call (the call made to fitFeatureModel), fitZeroLogNormal (a list of parameter estimates for the zero-inflated log-normal model), design (model matrix), taxa (taxa names), counts (count matrix), pvalues (calculated P-values), and permuttedfits (permutted z-score estimates under the null). The model outputs provide the useful summary tables: MRcoefs, MRtable, and MRfulltable.

Here, we use fitFeatureModel() to conduct hypothesis testing of mice microbiome differential abundance to compare the differences between genotype/group factors. This study was a longitudinal design with two measurements at before and post treatment. To illustrate this model, here we use the data at post treatment.

We can easily subset the samples based on the criteria of features (OTUs) and sample variables in MRexperiment-class object:

> # Subsetting MRexperiment-class object

> OTUsToKeep = which(rowSums(MRobj) >= 100)

> samplesToKeep = which(pData(MRobj)$Time =="Post")

> obj_Post = MRobj[OTUsToKeep, samplesToKeep]

After subsetting the data, we can process other steps such as remove NA in group and filter the data for effectively modeling:

> # Combine the cumNorm() and cumNormStatFast() functions to calculate the scaling factors

> QtRNAData_Post = cumNorm(obj_Post, p = cumNormStatFast(MRobj))

> pd <- pData(QtRNAData_Post)

> mod <- model.matrix(~1 + Group, data = pd)

> QtRNA_Rest = fitFeatureModel(QtRNAData_Post, mod)

Step 5: Save and explore analysis results.

We can use the functions MRcoefs(), MRtable(), and MRfulltable() to view coefficient fits and related statistics and export the data with optional output values (Table 12.1):

Table 12.1

Modeling results using fitFeatureModel based on post-QtRNA mouse microbiome data

	+Samples in group 0	+samples in group 1	Counts in group 0	counts in group 1	oddsRatio	Lower	Upper	FisherP	fisherAdjP	logFC	se	p-values	adjPvalues
OTU_126	2	10	3	10052	0	0	0.261985148691248	0.000714455822814956	0.0147177899499881	7.7572762392443	0.782297610849027	0	0
OTU_116	3	10	10	8130	0	0	0.388279386040879	0.00.309597523219815	0.0631475748194016	6.63063439517916	0.735958182613116	0	0
OTU_101	3	6	3	3058	0.305415044366772	0.030053635388153	2.46429182813556	0.369849964277209	1	5.390903521682831	1.24753104634402	2.17350999687227o-06	2.79839412097305o-05
OTU_125	1	10	2	3241	0	0	0.166690753277639	0.000119075970469159	0.00306620623958085	5.39137800592425	0.188856285124351	0	0
OTU_558	2	6	4	1863	0.184118110070328	0.0125264745480946	1.65925396045797	0.169802333889021	1	4.66860566955941	1.71741594144821	0.00656005395386594	0.0337842778624096
OTU_108	1	6	2	27889	0.0859350429538673	0.00146108353677648	1.05397598911988	0.0572755417956657	0.453798523457967	4.61985047754296	1.35624280574321	0.000658354420500995	0.00565087544263354
OTU_408	10	1	1044	1	Inf	5.9991330073025	Inf	0.000119075970469159	0.00306620623958085	-4.32710318121278	0.33177493173258	0	0
OTU_323	0	10	0	527	0	0	0.0897995510045625	1.0825088224469o-05	0.00111498408712031	3.39310932393747	0.388662351098798	0	0
OTU_97	1	6	3	4689	0.0859350429538673	0.00146108353677648	1.05397598911988	0.0572755417956657	0.453798523457967	3.18079174547734	1.10745461114788	0.0040767078299202	0.02210000477095674
OTU_367	1	6	1	850	0.0859350429538673	0.00146108353677648	1.05397598911988	0.0572755417956657	0.453798523457967	2.67579145567032	0.78377736855131	0.000640239150733191	0.00565087544263354

> # Review coefficients

> head(MRcoefs(QtRNA_Rest))

> fulltable <- MRfulltable(QtRNA_Rest)

> # Write results table

> # Make the table

> library(xtable)

> table <- xtable(fulltable,caption = "Full Table of Modeling Results",lable="tax_table_fitFeatureModel ")

> print.xtable(table,type="html",file = "Ch12_1_MetagenomeSeq_Table_fitFeatureModel_QtRNA.html")

> write.csv(fulltable,file = paste("Ch12_1_Results_MetagenomeSeq_Table_fitFeatureModel_QtRNA.csv",sep = ""))

> # Review full table

> MRfulltable(QtRNA_Rest)

Step 6: Visualize Features.

MetagenomeSeq has developed several plotting functions to help with visualization and analysis of datasets.

We can use these plotting functions to gain insight into the overall structure and particular individual features of the dataset. In the following we illustrate some plots to review the data structure.

Step 6a: plotMRheatmap – heatmap of abundance estimates (Fig. 12.1).

Fig. 12.1
Heatmap of abundance estimates based on post-QtRNA mouse microbiome data

The plotMRheatmap() plots heatmap of the “n” features with greatest variance across rows for normalized counts:

> Figure 12.1

> QtRNA_exp = pData(obj_Post)$Group

> heatmapColColors = brewer.pal(12, "Set3")[as.integer(factor(QtRNA_exp))]

> heatmapCols = colorRampPalette(brewer.pal(9, "RdBu"))(50)

> # plotMRheatmap

> plotMRheatmap(obj = obj_Post, n = 100, norm = TRUE, log = TRUE, fun = sd,

+ cexRow = 0.4, cexCol = 0.4, trace = "none", col = heatmapCols, ColSideColors = heatmapColColors)

The above commands create a heatmap and hierarchical clustering of log₂ transformed counts for the 100 OTUs with the largest overall variance. Red values indicate counts close to zero. Row color labels indicate OTU taxonomic class; column color labels indicate genotype (yellow = KO, green = WT). Notice that the samples are obviously clustered by genotypes.

Step 6b: plotCorr – heatmap of pairwise correlations (Fig. 12.2).

Fig. 12.2
Heatmap of pairwise correlations based on post-QtRNA mouse microbiome data

The plotCorr() plots a heatmap of pairwise correlations with the “n” features with greatest variance across rows for normalized or unnormalized counts. The following commands plot a correlation matrix for the same features:

> Figure 12.2

> # plotCorr

> plotCorr(obj = obj_Post, n = 50, norm = TRUE, log = TRUE, fun = cor,

+ cexRow = 0.25, cexCol = 0.25, trace = "none", dendrogram = "none", col = heatmapCols)

Default value being used.

Step 6c: plotOrd – PCA/MDS components (Fig. 12.3).

Fig. 12.3
Ordination plot with MDS components based on post-QtRNA mouse microbiome data

The plotOrd () function plots either PCA or MDS coordinates for the distances of normalized or unnormalized counts and helps to visualize or uncover the batch effects or feature relationships. The following commands plot the MDS components of the data. Typically it is recommended to remove outliers before plotting. Here we do not find outliers and none of the data is removed:

> Figure 12.3

> class = factor(pData(obj_Post)$Group)

> # plotOrd

> plotOrd(obj_Post, tran = TRUE, usePCA = FALSE, useDist = TRUE,

+ bg = class, pch = 21)

We can also load the vegan package and set distfun = vegdist and use dist.method = ‘bray’ to generate the PCA using the Bray-Curtis dissimilarity method.

12.3 Some Additional Statistical Tests in metagenomeSeq

Below we will illustrate some useful tests in the metagenomeSeq package.

12.3.1 Log-Normal Permutation Test of Abundance

We can fit the same model as above using a standard log-normal linear model to provide a permutation-based P-values. This method was originally used by metastats (White et al. 2009; Paulson et al. 2011) for non-sparse large samples. The fitLogNormal() function is a wrapper to perform the permutation test on the t-statistic. There is an option to choose the CSS normalization and use log₂ transformation to transform the data. Here, we use 1000 permutations to provide P-value resolution to the thousandth. We specify the coefficient parameter equal to 2 to test the coefficient of interest (here, Group). A list of significant features is generated to hold the significant features:

> coeff = 2 # Here it is specified to test the coefficients of Group

> rest1 = fitLogNormal(obj = obj_Post, mod = mod, useCSSoffset = FALSE,

+ B = 1000, coef = coeff)

> # Extract p-values and adjust for multiple testing the p-values (rest1$p)

> # that are calculated through permutation

> adjustedPvalues = p.adjust(rest1$p, method = "fdr")

> # Extract the absolute fold-change estimates

> foldChange = abs(rest1$fit$coef[, coeff])

> # Retain the significant features after adjusting

> sigList = which(adjustedPvalues <= 0.05)

> # Order the significant features

> sigList = sigList[order(foldChange[sigList])]

> # View the top taxa associated with the coefficient of interest

> taxa[sigList]

[1] NA "Ruminococcus callidus et rel."

[3] NA NA

[5] "Clostridium orbiscindens et rel." NA

[7] NA NA

[9] NA NA

[11] NA "Papillibacter cinnamivorans et rel."

[13] "Outgrouping clostridium cluster XIVa" NA

[15] NA "Clostridium leptum et rel."

[17] "Dorea formicigenerans et rel." "Oscillospira guillermondii et rel."

[19] NA NA

[21] NA "Clostridium symbiosum et rel."

[23] "Faecalibacterium prausnitzii et rel." "Coprococcus eutactus et rel."

[25] "Lachnobacillus bovis et rel."

12.3.2 Presence-Absence Testing of the Proportion/Odds

The presence-absence testing is implemented to test the hypothesis that the proportion/odds of a given feature presented is higher/lower among one group of individuals compared to another. The test uses the fitPA() to perform Fisher’s exact test to create a 2 × 2 contingency table and calculate P-values, odd’s ratios, and confidence intervals. The function accepts either a MRexperiment object or matrix as input data:

> class = pData(obj_Post)$Group

> rest2 = fitPA(obj_Post, cl = class,adjust.method = "fdr")

> head(rest2,3)

oddsRatio lower upper pvalues adjPvalues

OTU_22 0 0 Inf 1 1

OTU_57 0 0 Inf 1 1

OTU_59 0 0 Inf 1 1

> tail(rest2,3)

oddsRatio lower upper pvalues adjPvalues

OTU_584 8.0182 0.638635 470.18 0.1409 0.9068

OTU_614 3.6108 0.230118 224.19 0.5820 1.0000

OTU_618 0.4625 0.006799 10.51 1.0000 1.0000

12.3.3 Discovery Odds Ratio Testing of the Proportion of Observed Counts

The discovery test is implemented to test the hypothesis that the proportion of observed counts for a feature of all counts are comparable between groups. The test uses the fitDO() to perform Fisher’s exact test to create a 2 × 2 contingency table and calculate P-values, odd’s ratios, and confidence intervals. The function accepts either a MRexperiment object or matrix as input data:

> class = pData(obj_Post)$Group

> rest3 = fitDO(obj_Post[1:50, ], cl = class, norm = FALSE, log = FALSE,adjust.method = "fdr")

> head(rest3,3)

oddsRatio lower upper pvalues adjPvalues

OTU_22 0 0 Inf 1 1

OTU_57 0 0 Inf 1 1

OTU_59 0 0 Inf 1 1

> tail(rest3,3)

oddsRatio lower upper pvalues adjPvalues

OTU_345 0 0 Inf 1 1

OTU_348 0 0 Inf 1 1

OTU_351 0 0 Inf 1 1

12.3.4 Perform Feature Correlations

We can implement correlationTest () and correctIndices() to conduct a pairwise test of the correlations of abundance features or samples for each row of a matrix or MRexperiment object. The function correlationTest () returns the basic Pearson, Spearman, Kendall correlation statistics for the rows of the input and the associated P-values. For a vector of length ncol(obj), this function also calculates the correlation of each row with the associated vector. However, we should be cautious when we infer correlation networks from genomic survey data (Friedman and Alm 2012):

> corr = correlationTest(obj_Post[1:5, ], method = "pearson", alternative = "two.sided", norm = FALSE, log = FALSE)

> head(corr)

correlation p

OTU_22-OTU_57 NA NA

OTU_22-OTU_59 NA NA

OTU_22-OTU_61 NA NA

OTU_22-OTU_81 NA NA

OTU_57-OTU_59 0.664267 0.001401

OTU_57-OTU_61 0.009267 0.969070

> corr_s = correlationTest(obj_Post[1:5, ], method = "spearman", alternative = "two.sided", norm = FALSE, log = FALSE)

> corr_k = correlationTest(obj_Post[1:5, ], method = "kendall", alternative = "two.sided", norm = FALSE, log = FALSE)

12.4 Illustrate Some Useful Functions in metagenomeSeq

To facilitate analysis using the metagenomeSeq package, we illustrate some useful functions.

12.4.1 Access the MRexperiment Object

Three pairs of functions are used to access the raw or normalized count matrix, feature, and phenotype information, respectively. The MRcounts () function is used for accessing the raw or normalized count matrix, the featureData() and fData() functions are used for accessing feature information, and the phenoData() and pData() functions are used for accessing phenotype information.

We use the MRcounts() function to access the raw count matrix:

> # Access raw or normalized counts matrix

> head(MRcounts(MRobj[, 1:6]),3)

Sun071.PG1 Sun027.BF2 Sun066.PF1 Sun029.BF4 Sun068.PF3 Sun026.BF1

OTU_1 0 0 0 0 0 0

OTU_2 0 0 0 0 1 0

OTU_3 0 0 0 0 1 0

The feature information can be accessed with the featureData() and fData() functions. The differences are that the method of featureData() provides the S4 class information, while the method of fData() provides the information on data frame of taxonomy table:

> # Access feature information with the featureData method

> featureData(MRobj)

An object of class 'AnnotatedDataFrame'

featureNames: OTU_1 OTU_2 ... OTU_635 (635 total)

varLabels: Kingdom Phylum ... Species (7 total)

varMetadata: labelDescription

> # Access feature information with the fData method

> head(fData(MRobj)[, -c(2, 10)], 3)

Kingdom Class Order Family Genus Species

OTU_1 D_0__Archaea D_2__Nitrososphaeria D_3__Nitrososphaerales D_4__Nitrososphaeraceae Other Other

OTU_2 D_0__Bacteria D_2__Acidobacteriia D_3__Solibacterales D_4__Solibacteraceae (Subgroup 3) D_5__Bryobacter Other

OTU_3 D_0__Bacteria D_2__Acidobacteriia D_3__Solibacterales D_4__Solibacteraceae (Subgroup 3) Other Other

The phenotype information can be accessed with the phenoData () and pData () functions. The differences are that the method of phenoData () provides the S4 class information, while the method of pData () provides the information on data frame of metadata table:

> # Access phenotype information with the phenoData method

> phenoData(MRobj)

An object of class 'AnnotatedDataFrame'

sampleNames: Sun071.PG1 Sun027.BF2 ... Sun062.PE2 (40 total)

varLabels: MouseID Genotype ... Total.Read (6 total)

varMetadata: labelDescription

> # Access phenotype information with the pData method

> head(pData(MRobj), 3)

MouseID Genotype Group Time Group4 Total.Read

Sun071.PG1 PG1 KO 1 Post KO_POST 61851

Sun027.BF2 BF2 WT 0 Before WT_BEFORE 42738

Sun066.PF1 PF1 WT 0 Post WT_POST 54043

12.4.2 Subset the MRexperiment Object

We can subset the MRexperiment-class object. The following commands subset the object “MRobj” with row sum greater than 100 features for before-treatment samples:

> # Subset MRexperiment-class object

> sub_feat = which(rowSums(MRobj) >= 100)

> sub_samp = which(pData(MRobj)$Time == "Before")

> obj_f = MRobj[sub_feat, sub_samp]

> obj_f

MRexperiment (storageMode: environment)

assayData: 103 features, 20 samples

element names: counts

protocolData: none

phenoData

sampleNames: Sun027.BF2 Sun029.BF4 ... Sun035.BG5 (20 total)

varLabels: MouseID Genotype ... Total.Read (6 total)

varMetadata: labelDescription

featureData

featureNames: OTU_22 OTU_57 ... OTU_618 (103 total)

fvarLabels: Kingdom Phylum ... Species (7 total)

fvarMetadata: labelDescription

experimentData: use 'experimentData(object)'

Annotation:

> head(pData(obj_f), 3)

MouseID Genotype Group Time Group4 Total.Read

Sun027.BF2 BF2 WT 0 Before WT_BEFORE 42738

Sun029.BF4 BF4 WT 0 Before WT_BEFORE 53760

Sun026.BF1 BF1 WT 0 Before WT_BEFORE 56523

Above outputs show that the sub-object has 20 samples in before-treatment.

12.4.3 Filter the MRexperiment Object or Count Matrix

We can use the filterData () function to filter the data to maintain a threshold of minimum depth or OTU presence. The following commands filter the data based on presenting 1 number of features after filtering samples by at least depth of coverage 100. This function can be conducted on a MRexperiment object or count matrix:

> filter_obj<- filterData(MRobj, present = 1, depth = 100)

12.4.4 Merge the MRexperiment Object

We can merge two MRexperiment-class objects using the mergeMRexperiments() function as below:

> MRobj1<-MRobj

> merge_obj = mergeMRexperiments(MRobj, MRobj1)

12.4.5 Call the Normalized Counts Using the cumNormMat() and MRcounts()

Below we use the function cumNormMat() to return the normalized matrix using the percentile generated by the cumNormStatFast () function and scale by 1000:

> cumNormMat(MRobj, p = cumNormStatFast(MRobj), sl = 1000)

> head(cumNormMat(MRobj, p = cumNormStatFast(MRobj), sl = 1000),3)

Default value being used.

Sun071.PG1 Sun027.BF2 Sun066.PF1 Sun029.BF4 Sun068.PF3 Sun026.BF1 Sun072.PG2 Sun076.PH1 Sun063.PE3 Sun036.BH1 Sun024.BE4 Sun065.PE5 Sun033.BG3 Sun040.BH5 Sun030.BF5

OTU_1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

OTU_2 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0

OTU_3 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0

------

where p is used to specify the pth quantile and sl is used to specify the value to scale by (default = 1000).

Below we use the function MRcounts () to access the counts slot of a MRexperiment object. The counts slot holds the raw count data representing (along the rows) the number of reads annotated for a particular feature and (along the columns) the sample.

The syntax is given as.

MRcounts(obj, norm = TRUE, log = FALSE, sl = 1000)

where obj presents a MRexperiment object, norm is the logical indicator for whether or not to return normalized counts; log is the logical indicator for whether or not to log₂ transform scale; and sl is used to specify the value to scale by (default = 1000).

It returns raw counts when specifying norm = FALSE:

> head(MRcounts(MRobj, norm = FALSE, log = FALSE, sl = 1000),3)

Sun071.PG1 Sun027.BF2 Sun066.PF1 Sun029.BF4 Sun068.PF3 Sun026.BF1 Sun072.PG2 Sun076.PH1 Sun063.PE3 Sun036.BH1 Sun024.BE4 Sun065.PE5 Sun033.BG3 Sun040.BH5 Sun030.BF5

OTU_1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

OTU_2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

OTU_3 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

------

It returns normalized counts when specifying norm = TRUE:

> head(MRcounts(MRobj, norm = TRUE, log = FALSE, sl = 1000),3)

Default value being used.

Sun071.PG1 Sun027.BF2 Sun066.PF1 Sun029.BF4 Sun068.PF3 Sun026.BF1 Sun072.PG2 Sun076.PH1 Sun063.PE3 Sun036.BH1 Sun024.BE4 Sun065.PE5 Sun033.BG3 Sun040.BH5 Sun030.BF5

OTU_1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

OTU_2 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0

OTU_3 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0

------

12.4.6 Calculate the Normalization Factors Using the calcNormFactors()

> nfactor = calcNormFactors(MRobj, p = cumNormStatFast(MRobj))

Default value being used.

> head(nfactor,3)

normFactors

Sun071.PG1 200

Sun027.BF2 83

Sun066.PF1 162

> nfactor1 = calcNormFactors(MRobj, p = 0.75)

> head(nfactor1,3)

normFactors

Sun071.PG1 1655

Sun027.BF2 1068

Sun066.PF1 1145

> # Normalize the samples

> normFactors(MRobj) <- rnorm(ncol(MRobj))

> head(normFactors(MRobj))

Sun071.PG1 Sun027.BF2 Sun066.PF1 Sun029.BF4 Sun068.PF3 Sun026.BF1

-0.9120 1.5436 1.6109 0.3158 0.6128 -0.2461

12.4.7 Access the Library Sizes Using the libSize()

We can use the libSize() function to access the library sizes (sequencing depths):

> # Access the library sizes (sequencing depths)

> head(libSize(MRobj),3)

[,1]

Sun071.PG1 61841

Sun027.BF2 42729

Sun066.PF1 54025

12.4.8 Save the Normalized Counts Using the exportMat()

The exportMat() function exports the normalized MRexperiment dataset as a matrix.

The syntax is given as.

exportMat(obj, log = TRUE, norm = TRUE, sep = "\t", file = "matrix.tsv")

where obj is a MRexperiment object or count matrix, log is used to specify whether or not to log transform the counts in the case of MRexperiment object, norm is used to specify whether or not to normalize the counts in the case of MRexperiment object, sep is the separator for writing out the count matrix, and file is used to name the output file. This function allows us to output the dataset to the workspace as a tab-delimited file for convenience of downstream analysis:

> exportMat(MRobj, norm = TRUE, sep = "\t", file = "QtRNA_norm.tsv")

12.4.9 Save the Sample Statistics Using the exportStats()

The exportStats () function exports a matrix of values for each sample including various sample statistics: sample ids, the sample scaling factor, quantile value, the number identified features, and library size (depth of coverage):

> exportStats(MRobj, file="QtRNA_norm.tsv")

Default value being used.

> head(read.csv(file= "QtRNA_norm.tsv", sep="\t"),3)

Subject Scaling.factor Quantile.value Number.of.identified.features Library.size

1 Sun071.PG1 -0.912 17 130 61841

2 Sun027.BF2 1.544 5 101 42729

3 Sun066.PF1 1.611 10 125 54025

12.4.10 Find Unique OTUs or Features

We can use the function uniqueFeatures () to find features absent from any number of classes. The returned table lists the feature ids, the number of positive features, and reads for each group. Thresholding for the number of positive samples or reads required are options:

> class = pData(obj_Post)[["Group"]]

> uniqueFeatures(obj_Post, class, nsamples = 5, nreads = 10)

featureIndices Samp. in 0 Samp. in 1 Reads in 0 Reads in 1

OTU_323 41 0 10 0 527

OTU_420 81 5 0 109 0

OTU_429 86 6 0 90 0

OTU_510 94 0 5 0 285

12.4.11 Aggregate Taxa

Normalization is recommended at the OTU level. However, the count matrix (either normalized or not) can be aggregated based on a particular user-defined level of taxonomy. Currently, there are two functions called aggregateByTaxonomy() and aggTax() in the metagenomicsSeq package for using the feature-data information in the MRexperiment object to aggregate taxa. The aggregateByTaxonomy() and aggTax() are used to aggregate a MRexperiment object or count matrix to a particular level. We first call the aggregateByTaxonomy() function on the MRexperiment object obj_Post to aggregate the “Genus” level (a featureData column) counts using the default colSums () function:

> colnames(fData(obj_Post))

[1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"

> aggregateByTaxonomy(obj_Post,lvl="Genus", norm=TRUE, aggfun=colSums)

Default value being used.

MRexperiment (storageMode: environment)

assayData: 54 features, 20 samples

element names: counts

protocolData: none

phenoData

sampleNames: Sun071.PG1 Sun066.PF1 ... Sun062.PE2 (20 total)

varLabels: MouseID Genotype ... Total.Read (6 total)

varMetadata: labelDescription

featureData

featureNames: D_5__[Eubacterium] coprostanoligenes group D_5__[Eubacterium] xylanophilum group ... Other (54 total)

fvarLabels: Kingdom Phylum ... Genus (6 total)

fvarMetadata: labelDescription

experimentData: use 'experimentData(object)'

Annotation:

The colMedians() can also be used:

> aggregateByTaxonomy(obj_Post, lvl="Genus", norm=TRUE, aggfun=matrixStats::colMedians)

We then call the aggTax () function on the MRexperiment object obj_Post to aggregate the “Phylum” level (a featureData column) counts using the default colSums () function:

> aggTax(obj_Post, lvl='Phylum',norm=FALSE, alternate = FALSE, log = FALSE,

+ aggfun = colSums, sl = 1000, featureOrder = NULL,

+ returnFullHierarchy = TRUE, out = "MRexperiment")

MRexperiment (storageMode: environment)

assayData: 8 features, 20 samples

element names: counts

protocolData: none

phenoData

sampleNames: Sun071.PG1 Sun066.PF1 ... Sun062.PE2 (20 total)

varLabels: MouseID Genotype ... Total.Read (6 total)

varMetadata: labelDescription

featureData

featureNames: D_1__Actinobacteria D_1__Bacteroidetes ... D_1__Tenericutes (8 total)

fvarLabels: Kingdom Phylum

fvarMetadata: labelDescription

experimentData: use 'experimentData(object)'

Annotation:

The colMedians() can also be used:

> aggTax(obj_Post, lvl='Phylum',norm=FALSE, alternate = FALSE, log = FALSE,

+ aggfun = matrixStats::colMedians, sl = 1000, featureOrder = NULL,

+ returnFullHierarchy = TRUE, out = "MRexperiment")

12.4.12 Aggregate Samples

Samples can also be aggregated using the phenoData information in the MRexperiment object. Currently, there are two functions called aggregateBySample() and aggSamp() in the metagenomicsSeq package for using the phenoData information in the MRexperimen to aggregate a MRexperiment object or count matrix to by a factor.

In the following commands, we first call the aggregateBySample() function on the MRexperiment object obj_Post to use the Group (a phenoData column name) to aggregate counts with the default rowMeans () aggregating function:

> aggregateBySample(obj_Post, fct="Group",aggfun=rowSums)

MRexperiment (storageMode: environment)

assayData: 103 features, 2 samples

element names: counts

protocolData: none

phenoData

sampleNames: 0 1

varLabels: phenoData

varMetadata: labelDescription

featureData

featureNames: OTU_22 OTU_57 ... OTU_618 (103 total)

fvarLabels: Kingdom Phylum ... Species (7 total)

fvarMetadata: labelDescription

experimentData: use 'experimentData(object)'

Annotation:

The rowMedians () can also be used:

> aggregateBySample(obj_Post, fct="Group",aggfun= matrixStats::rowMedians)

We then call the aggSamp() function on the MRexperiment object obj_Post to aggregate the sample counts using the default rowMeans () function:

> aggSamp(obj_Post, fct="Group",aggfun= rowMeans, out = "MRexperiment")

MRexperiment (storageMode: environment)

assayData: 103 features, 2 samples

element names: counts

protocolData: none

phenoData

sampleNames: 0 1

varLabels: phenoData

varMetadata: labelDescription

featureData

featureNames: OTU_22 OTU_57 ... OTU_618 (103 total)

fvarLabels: Kingdom Phylum ... Species (7 total)

fvarMetadata: labelDescription

experimentData: use 'experimentData(object)'

Annotation:

The rowMaxs () can also be used:

> aggSamp(obj_Post, fct="Group",aggfun=matrixStats::rowMaxs)

In microbiome data analysis, the data is often required to be aggregated. The functions aggregateByTaxonomy(), aggregateBySample(), aggTax(), and aggSamp() are very flexible. They can facilitate data in either a matrix with a vector of labels or a MRexperiment object with a vector of labels or featureData column name. All these functions can output either a matrix or MRexperiment object.

12.5 Remarks on CSS Normalization, ZIG, and ZILN

CSS method divides raw counts into the cumulative sum of counts up to a percentile determined using a data-driven approach, which is an adaptive extension of the quantile normalization method (Bullard et al. 2010). CSS normalization may have the advantages, including the following: (1) CSS may have superior performance for weighted metrics, although it is arguable (Paulson et al. 2013, 2014; Costea et al. 2014; Weiss et al. 2017). (2) CSS is similar to log upper quartile (LUQ), but it is flexible than LUQ because it allows a dependent threshold to determine each sample’s quantile divisor (Weiss et al. 2017). (3) It was shown that CSS along with GMPR is overall more robust to sample-specific outlier OTUs than TSS and RNA-seq normalization methods (Chen et al. 2018). For CSS, it is crucial to ensure that chosen quantile is appropriate so that the normalization approach does not introduce normalization-related artifacts in the data, in which the normalization method assumes that all the count distributions of samples are roughly equivalent and independent of each other up to this quantile. However, the challenge is that the percentiles that CSS is based on could not be determined for microbiome datasets with high variable counts (Chen et al. 2018).

ZIG and ZILN are implemented via the metagenomeSeq Bioconductor package. The metagenomeSeq methods have the advantages, including:

1.
It was shown that metagenomeSeq outperforms other widely used statistical methods in the field including Metastats (White et al. 2009), LEfSe’s Kruskal-Wallis test (Segata et al. 2011), and representative methods for RNA-seq analysis, edgeR (Robinson et al. 2010) and DESeq (Anders and Huber 2010) in terms of the area under the curve (AUC) scores (Paulson et al. 2013). Others also showed that metagenomeSeq has slightly higher powers than most of the other DA methods (Lin and Peddada 2020b). (2) It yields a more precise biological interpretation of microbiome data (Paulson et al. 2013). (3) ZIG performs well when there is an adequate number of biological replicates (McMurdie and Holmes 2014).
However, the disadvantages of ZIG include a higher false-positive rate (FDR) (McMurdie and Holmes 2014; Mandal et al. 2015; Weiss et al. 2017), and the problem of FDR inflation gets worse when sample size or the effect size (i.e., fold change of mean absolute abundances) increases (Weiss et al. 2017; Lin and Peddada 2020a). This disadvantage is mostly criticized in microbiome literature; currently it was shown that ZIG is subject to inflate FDR even with the normalized observed abundances by its built-in scaling method (CSS) (Lin and Peddada 2020b).
However, we should point out that some evaluated disadvantages of ZIG and ZILN were based on compositional approach. In microbiome literature, the compositional approach is also arguable. Additionally, ZIG was shown increasing FDR that was assessed, using rarefied data (McMurdie and Holmes 2014; Weiss et al. 2017). This also highlights the difference between count-based (e.g., ZIG) and compositional approach. ZIG was developed for 16S count data, and thus it is reasonable that it is not suitable for compositional or proportion data. ZIG and its CSS require original library size to capture the zero proportions. Rarefying biological count data omits available valid data, and hence it is statistically inadmissible (McMurdie and Holmes 2014). Therefore, criticizing ZIG based on rarefying is not a strong argument because the use of rarefication itself is also arguable.
2.
ZIG was developed with the log transformation on the read counts using an empirical Bayes procedure to estimate the moderated variances. This was reviewed it is not clear how to extend the empirical Bayes method in longitudinal setting and is not trivial to simply include certain random effects by extending these methods (Chen and Li 2016).
3.
MetagenomeSeq is specifically developed for handling the high number of zero observations encountered in metagenomic data. However, inference is done after transformation using a log transformation (log₂(y_ij + 1)) followed by correction for zero inflation based on a Gaussian mixture model (Paulson et al. 2013). Thus, actually metagenomeSeq moderates the gene-specific variance estimates (Smyth 2004) by some kind of removing zeros.

So far, most performances of metagenomeSeq have been evaluated based on ZIG; recently, one study demonstrated that ZILN mixture model successfully controls the FDR under 5% in all simulations, but suffers a severe loss of power (Lin and Peddada 2020a). Additional evaluations based on the ZILN are needed for providing further information on the metagenomeSeq methods.

12.6 Summary

In this chapter, we focused on introducing the metagenomeSeq methods including ZIG and ZILN mixture models as well as additional statistical tests and useful functions in metagenomeSeq. We first introduced TSS and CSS normalization methods. CSS is one critical component of the metagenomeSeq methods. We then described the ZIG and ZILN models. Next, we illustrated how to implement ZILN via the metagenomeSeq package. Currently ZILN or fitFeatureModel() function is recommended for use over ZIG functions or fitZig() function in the metagenomeSeq manual. Followed that, we illustrated some additional statistical tests that are available in metagenomeSeq: log-normal permutation test of abundance, presence-absence testing of the proportion/odds, discovery odds ratio testing of the proportion of observed counts, as well as performing feature correlations. To facilitate analysis of microbiome using the metagenomeSeq methods, we also illustrated some useful functions in metagenomeSeq to access the MRexperiment object, subset the MRexperiment object, filter the MRexperiment object or count matrix, merge the MRexperiment object, call the normalized counts using the cumNormMat() and MRcounts(), calculate the normalization factors using the calcNormFactors(), access the library sizes using the libSize(), save the normalized counts using the exportMat(), save the sample statistics using the exportStats(), find unique OTUs or features, and aggregate taxa and samples. Finally, we provided some remarks on CSS, ZIG, and ZILN.

In Chap. 13, we will introduce zero-inflated beta models for microbiome data.

References

Anders, S., and W. Huber. 2010. Differential Expression Analysis for Sequence Count Data. Genome Biology 11 (10): R106. https://doi.org/10.1186/gb-2010-11-10-r106.CrossrefPubMedPubMedCentral
Bullard, James H., Elizabeth Purdom, Kasper D. Hansen, and Sandrine Dudoit. 2010. Evaluation of Statistical Methods for Normalization and Differential Expression in mRNA-Seq Experiments. BMC Bioinformatics 11 (1): 94. https://doi.org/10.1186/1471-2105-11-94.CrossrefPubMedPubMedCentral
Chen, E.Z., and H. Li. 2016. A Two-Part Mixed-Effects Model for Analyzing Longitudinal Microbiome Compositional Data. Bioinformatics 32 (17): 2611–2617. https://doi.org/10.1093/bioinformatics/btw308.CrossrefPubMedPubMedCentral
Chen, L., J. Reeve, L. Zhang, S. Huang, X. Wang, and J. Chen. 2018. GMPR: A Robust Normalization Method for Zero-Inflated Count Data with Application to Microbiome Sequencing Data. PeerJ 6: e4600. https://doi.org/10.7717/peerj.4600.CrossrefPubMedPubMedCentral
Costea, Paul I., Georg Zeller, Shinichi Sunagawa, and Peer Bork. 2014. A Fair Comparison. Nature Methods 11 (4): 359–359.CrossrefPubMed
Dillies, Marie-Agnès, Andrea Rau, Julie Aubert, Christelle Hennequet-Antier, Marine Jeanmougin, Nicolas Servant, Céline Keime, Guillemette Marot, David Castel, Jordi Estelle, Gregory Guernec, Bernd Jagla, Luc Jouneau, Denis Laloë, Caroline Le Gall, Brigitte Schaëffer, Stéphane Le Crom, Mickaël Guedj, Florence Jaffrézic, and on behalf of The French StatOmique Consortium. 2012. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics 14 (6): 671–683. https://doi.org/10.1093/bib/bbs046.Crossref
Friedman, J., and E.J. Alm. 2012. Inferring correlation networks from genomic survey data. PLoS Computational Biology 8 (9): e1002687. https://doi.org/10.1371/journal.pcbi.1002687.CrossrefPubMedPubMedCentral
Gentleman, R., V. Carey, M. Morgan, and S. Falcon. 2020. Biobase: Base functions for bioconductor, version 2.46.0.
Lin, Huang, and Shyamal Das Peddada. 2020a. Analysis of compositions of microbiomes with bias correction. Nature Communications 11 (1): 3514. https://doi.org/10.1038/s41467-020-17041-7.CrossrefPubMedPubMedCentral
———. 2020b. Analysis of microbial compositions: A review of normalization and differential abundance analysis. npj Biofilms and Microbiomes 6 (1): 60. https://doi.org/10.1038/s41522-020-00160-w.CrossrefPubMedPubMedCentral
Mandal, Siddhartha, Will Van Treuren, Richard A. White, Merete Eggesbø, Rob Knight, and Shyamal D. Peddada. 2015. Analysis of composition of microbiomes: A novel method for studying microbial composition. Microbial Ecology in Health and Disease 26 (1): 27663.PubMed
McMurdie, Paul J., and Susan Holmes. 2014. Waste not, want not: Why rarefying microbiome data is inadmissible. PLoS Computational Biology 10 (4): e1003531. https://doi.org/10.1371/journal.pcbi.1003531.CrossrefPubMedPubMedCentral
Morton, James T., Jon Sanders, Robert A. Quinn, Daniel McDonald, Antonio Gonzalez, Yoshiki Vázquez-Baeza, Jose A. Navas-Molina, Se Jin Song, Jessica L. Metcalf, and Embriette R. Hyde. 2017. Balance trees reveal microbial niche differentiation. MSystems 2 (1).
Paulson, Joseph N., Hector Corrada Bravo, and Mihai Pop. 2014. A fair comparison reply. Nature Methods 11 (4): 359–360.CrossrefPubMed
Paulson, Joseph N., Mihai Pop, and Hector Corrada Bravo. 2011. Metastats: An improved statistical method for analysis of metagenomic data. Genome Biology 12 (Suppl 1): P17–P17. https://doi.org/10.1186/gb-2011-12-s1-p17. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439073/.CrossrefPubMedCentral
Paulson, Joseph N., O. Colin Stine, Héctor Corrada Bravo, and Mihai Pop. 2013. Differential abundance analysis for microbial marker-gene surveys. Nature Methods 10 (12): 1200–1202. https://doi.org/10.1038/nmeth.2658. https://pubmed.ncbi.nlm.nih.gov/24076764, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4010126/.CrossrefPubMedPubMedCentral
Paulson, Joseph Nathaniel, Modi ed: October 4, 2016. Compiled: February 3, 2020, 2020, metagenomeSeq: Statistical analysis for sparse high-throughput sequencing.
Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2010. edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 (1): 139–140.CrossrefPubMed
Segata, Nicola, Jacques Izard, Levi Waldron, Dirk Gevers, Larisa Miropolsky, Wendy S. Garrett, and Curtis Huttenhower. 2011. Metagenomic biomarker discovery and explanation. Genome Biology 12 (6): 1–18.Crossref
Smyth, Gordon K. 2004. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3 (1).
———. 2005. Limma: Linear models for microarray data. In Bioinformatics and computational biology solutions using R and bioconductor, 397–420. Springer.Crossref
Soneson, Charlotte, and Mauro Delorenzi. 2013. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics 14 (1): 91. https://doi.org/10.1186/1471-2105-14-91.CrossrefPubMedPubMedCentral
Storey, John D., and Robert Tibshirani. 2003. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences 100 (16): 9440–9445. https://doi.org/10.1073/pnas.1530509100. https://www.pnas.org/content/pnas/100/16/9440.full.pdf.Crossref
Tsilimigras, Matthew C.B., and Anthony A. Fodor. 2016. Compositional Data Analysis of the Microbiome: Fundamentals, Tools, and Challenges. Annals of Epidemiology 26 (5): 330–335.CrossrefPubMed
Weiss, Sophie, Xu Zhenjiang Zech, Shyamal Peddada, Amnon Amir, Kyle Bittinger, Antonio Gonzalez, Catherine Lozupone, Jesse R. Zaneveld, Yoshiki Vázquez-Baeza, Amanda Birmingham, Embriette R. Hyde, and Rob Knight. 2017. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 5 (1): 27. https://doi.org/10.1186/s40168-017-0237-y.CrossrefPubMedPubMedCentral
White, James Robert, Niranjan Nagarajan, and Mihai Pop. 2009. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Computational Biology 5 (4): e1000352. https://doi.org/10.1371/journal.pcbi.1000352.CrossrefPubMedPubMedCentral
Zhang, Jilei, Lu Rong, Yongguo Zhang, Żaneta Matuszek, Wen Zhang, Yinglin Xia, Tao Pan, and Jun Sun. 2020. tRNA Queuosine modification enzyme modulates the growth and microbiome recruitment to breast tumors. Cancers 12 (3): 628. https://doi.org/10.3390/cancers12030628. https://pubmed.ncbi.nlm.nih.gov/32182756, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7139606/.CrossrefPubMedPubMedCentral